Skip to content

feat(PLU-161): flatten_metadata option for databricks_volume_delta_tables#712

Merged
awalker4 merged 13 commits into
mainfrom
austin/plu-161-flatten-metadata-volumes
May 22, 2026
Merged

feat(PLU-161): flatten_metadata option for databricks_volume_delta_tables#712
awalker4 merged 13 commits into
mainfrom
austin/plu-161-flatten-metadata-volumes

Conversation

@awalker4
Copy link
Copy Markdown
Contributor

@awalker4 awalker4 commented May 20, 2026

Summary

  • Adds opt-in flatten_metadata: bool (default False) to DatabricksVolumeDeltaTableUploaderConfig and DatabricksVolumeDeltaTableStagerConfig.
  • When set, the stager recursively flattens nested metadata into top-level columns (no metadata_ prefix) using the existing flatten_dict helper (utils/data_prep.py) — stops at lists; matches Milvus's flat naming.
  • When set, the uploader skips auto-create and intersects incoming columns with the user-managed table schema, dropping unknowns with a log line (mirrors the existing _fit_to_schema message in sql/sql.py).
  • Switches the destination-registry's upload_stager_config from the empty SQL pass-class to the now-non-empty Volumes-specific stager config so the new field surfaces in the dashboard schema. Safe — verified no external imports of the swapped class and no field loss.
  • Backward compatibility is preserved: with the flag absent or False, behavior is byte-identical to today (single metadata VARIANT blob, default auto-create). Workflow DB-serialized configs without the field deserialize cleanly with flatten_metadata=False.
  • No type coercion on the flatten path — values flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields (date_created, date_modified, date_processed, and the data_source_* variants) as STRING; they arrive as stringified or float unix epochs.

Why

Linear: PLU-161. Customers want to query Delta tables via normal SQL columns (`SELECT filename, data_source_url ...`) rather than dotting into a `VARIANT` blob. Per Potter, this is also intended as the reusable pattern for other table-like connectors.

Example job run

All metadata fields are flattened. You can see languages comes in as a list,
Screenshot 2026-05-22 at 1 41 30 PM

Test plan

  • New unit tests: stager blob-mode default, flatten drops the metadata_ prefix, flatten stops at lists, both configs deserialize with default False for Workflow DB backcompat, uploader missing-table hard-fail, uploader PARSE_JSON wrapping on the non-flatten path, uploader unknown-column drop + log.
  • Existing test_databricks_delta_tables.py (SQL connector) and the destination-config JSON-schema test still pass — confirms the registry switch doesn't ripple.
  • SND manual matrix: cases include backcompat, default schema auto-create, flatten with custom table, missing-table hard-fail, nullable columns, nested keys, list-valued columns. Pending plugin-image build off this branch.

🤖 Generated with Claude Code

…bles

Adds an opt-in `flatten_metadata` flag to the Volumes Delta Tables uploader.
When set, the stager recursively flattens the metadata dict into top-level
`metadata_*` columns (stops at lists, preserves the `metadata_` prefix per
the PRD) using the existing `flatten_dict` helper. The uploader skips
auto-create and intersects incoming columns with the user-managed table
schema, dropping unknowns with a log line. Default is False so existing
Workflow DB-serialized configs continue to behave identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stringified unix epochs (e.g. metadata.data_source.date_processed = "1779329564.5102773")
are rejected by Databricks's implicit string -> TIMESTAMP cast with CAST_INVALID_INPUT.
Run flattened datetime fields through the SQL connector's existing parse_date_string
and emit ISO format so TIMESTAMP columns coerce natively. Malformed values pass through
unchanged so the table can reject them, matching the non-flatten path's semantics.

The known datetime field set is imported from sql.sql._DATE_COLUMNS so this stays in
sync with the rest of ingest as new datetime fields land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@awalker4
Copy link
Copy Markdown
Contributor Author

Finally address unrelated flaky test: #714

Comment thread unstructured_ingest/processes/connectors/databricks/volumes_table.py Outdated
Let metadata fields flow into Delta as their JSON-native types. Customers
designing the user-managed table should declare datetime fields as STRING.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Shadow auto-approve: would not auto-approve because issues were found.
Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread test/unit/connectors/databricks/test_volumes_table.py
… modes

After dropping coercion in 0c31cb2 the existing tests no longer exercised
how datetime metadata flows through to Delta. Pin the new contract: string
and float epochs pass through unchanged in flatten mode, and the blob mode
keeps them byte-identical inside the JSON column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Shadow auto-approve: would require human review. This PR adds opt-in metadata flattening to Databricks volume delta tables, modifying core stager and uploader logic with conditional branching, column filtering, and auto-create bypass—changes that carry risk to data integrity and require human review to ensure correctness before merging.

Re-trigger cubic

…issing

Adds a flatten-gated SHOW TABLES check to the Volumes Delta Tables precheck so
a misconfigured workflow fails at save time instead of per-document at run time.
create_destination under flatten mode becomes an early-return no-op so precheck
is the single source of truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Shadow auto-approve: would require human review. This PR adds an opt-in flatten_metadata feature that modifies stager and uploader logic, including metadata handling, table auto-create skip, and column intersection—changes that affect core ingestion behavior and could have subtle edge cases despite being default-off, so human review is warranted.

Re-trigger cubic

@potter-potter
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude finished @potter-potter's task —— View job


Code Review for PLU-161: flatten_metadata option

Todo List:

  • Read repository guidelines (CLAUDE.md) - None found
  • Review main implementation in volumes_table.py
  • Examine test coverage and quality
  • Check configuration changes
  • Review changelog and version updates
  • Provide comprehensive review feedback

✅ Overall Assessment

This is a well-implemented feature that adds opt-in metadata flattening to Databricks Volume Delta Tables. The implementation is thoughtful, maintains backward compatibility, and includes comprehensive test coverage.

🔍 Detailed Review

Implementation Quality: Excellent

Strengths:

  • Clean Design: The flatten_metadata flag is properly implemented in both stager and uploader configs with sensible defaults (False for backward compatibility)
  • Proper Separation: Stager handles flattening logic, uploader handles schema intersection - good separation of concerns
  • Error Handling: Precheck validation ensures table exists when flatten mode is enabled, preventing runtime failures
  • Logging: Appropriate info-level logging when dropping unknown columns

Key Implementation Details:

  • Stager flattens metadata using existing flatten_dict helper with flatten_lists=False - volumes_table.py:88-95
  • Uploader intersects columns with destination schema and logs drops - volumes_table.py:219-227
  • Auto-create is disabled in flatten mode as documented - volumes_table.py:117-119
  • Backward compatibility preserved through False default - volumes_table.py:44-50

Test Coverage: Comprehensive

The test suite covers all critical paths:

  • ✅ Default blob mode behavior (test_stager_blob_mode_is_default)
  • ✅ Metadata flattening without prefixes (test_stager_flatten_drops_metadata_prefix)
  • ✅ List preservation (test_stager_flatten_stops_at_lists)
  • ✅ Datetime passthrough contracts for both string and float epochs
  • ✅ Configuration backward compatibility (test_flatten_metadata_defaults_false_for_workflow_db_backcompat)
  • ✅ Uploader behaviors: precheck validation, column intersection, SQL generation
  • ✅ Error cases: missing tables, unknown columns

The tests properly verify the datetime handling contract mentioned in the PR description.

Security & Robustness: Good

  • No SQL injection risks - uses parameterized queries and proper quoting via quote_identifier
  • Fail-fast precheck prevents silent data loss in flatten mode
  • Column intersection prevents schema drift issues
  • No secrets or sensitive data handling concerns

Documentation & Usability: Clear

  • Field descriptions clearly explain the feature and requirements
  • Changelog entry is accurate and informative
  • Error messages are descriptive (e.g., "auto-create is disabled to prevent silent data loss")

🔧 Technical Notes

  1. Schema Intersection Logic: The uploader correctly filters incoming columns against destination schema (volumes_table.py:220-221), which is essential for user-managed tables.

  2. PARSE_JSON Handling: The implementation correctly wraps metadata with PARSE_JSON() only in non-flatten mode (volumes_table.py:231), maintaining the existing VARIANT behavior.

  3. Datetime Contract: No type coercion occurs in flatten mode - values flow through as their JSON-native types (strings or floats), which is the intended behavior per the PR description.

⚠️ Minor Observations

  1. Registry Switch: The destination registry now uses the Volumes-specific stager config instead of the empty SQL pass-class. This is safe as verified by the author, but worth noting for future maintenance.

  2. Error Context: The precheck error message could include the actual catalog/database context, but the current message is sufficiently clear.

✅ Approval Recommendation

This PR is ready for merge. The implementation is solid, well-tested, and follows established patterns in the codebase. The feature addresses a clear customer need while maintaining full backward compatibility.

The addressing of previous review comments (float type handling, test coverage improvements) demonstrates attention to feedback and code quality.


Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Shadow auto-approve: would require human review. Adds a new flatten_metadata feature that modifies core stager and uploader logic, including conditional auto-create skipping, schema intersection, and precheck queries, which introduces non-trivial business logic risk and requires human review for correctness and safety.

Re-trigger cubic

@potter-potter
Copy link
Copy Markdown
Contributor

@claude please give one more review. we dropped the datetime and timestamp conversions because we dont want to do that anymore

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Claude finished @potter-potter's task —— View job


Code Review for PLU-161 (Updated Review)

Todo List:

  • Read current implementation files to understand datetime changes
  • Review test coverage for datetime handling contract
  • Check for any remaining datetime conversion logic
  • Provide updated review feedback based on dropped conversions

✅ Updated Review - Datetime/Timestamp Conversion Removal

After reviewing the current implementation, I can confirm that the datetime and timestamp conversions have been properly removed. Here's my updated assessment:

Key Changes Confirmed:

  1. No Datetime Coercion Logic: The implementation in volumes_table.py has no datetime conversion code. Values flow through unchanged as their JSON-native types (strings or floats).

  2. Test Coverage Updated: The test suite now properly covers the datetime pass-through contract:

    • test_stager_flatten_passes_datetime_strings_through_unchanged() - verifies string epochs remain as strings
    • test_stager_flatten_passes_float_epoch_through_unchanged() - verifies numeric epochs remain as floats
    • test_stager_blob_mode_preserves_datetime_strings() - verifies non-flatten path is unchanged
  3. Clean Implementation: The stager simply calls flatten_dict() with flatten_lists=False and no additional processing (volumes_table.py:88-95).

Implementation Quality: Excellent

Strengths of the Current Approach:

  • Simplicity: No complex datetime parsing or conversion logic to maintain
  • Performance: No overhead from type checking and conversion
  • Flexibility: Customers control datetime field types in their Delta table schema
  • Consistency: Both string and float epochs are preserved exactly as received

Documentation & Contract:

The PR description and field descriptions clearly state the new contract:

"No type coercion on the flatten path — values flow into Delta as their JSON-native types. Customers designing the user-managed table should declare datetime fields (date_created, date_modified, date_processed, and the data_source_* variants) as STRING"

Test Coverage: Comprehensive

The updated test suite properly validates:

  • ✅ String epochs pass through unchanged: "1779329564.5102773""1779329564.5102773"
  • ✅ Float epochs pass through unchanged: 1779329564.51027731779329564.5102773
  • ✅ Type preservation: isinstance(row["data_source_date_processed"], float) assertion
  • ✅ Backward compatibility: blob mode behavior is unchanged

✅ Final Recommendation

The datetime/timestamp conversion removal is properly implemented. The changes:

  1. Eliminated complexity - No more datetime parsing/conversion logic to maintain
  2. Improved performance - No type checking overhead in the hot path
  3. Enhanced flexibility - Customers have full control over datetime field handling
  4. Maintained backward compatibility - Non-flatten mode is unchanged
  5. Clear contract - Well-documented expectation that customers declare datetime fields as STRING

This approach is cleaner and more maintainable than the previous datetime coercion approach. The implementation correctly fulfills the requirement to remove datetime/timestamp conversions while preserving all other functionality.

Status: ✅ Ready for merge


Copy link
Copy Markdown
Contributor

@potter-potter potter-potter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants